Method and System for Creating a Data Profile Engine, Tool Creation Engines and Product Interfaces for Identifying and Analyzing File and Sections of Files

ABSTRACT

A data profile engine identifies, classifies, analyzes, searches, compares and cross-references entire files and sections of files, records and other forms of electronic media, and a tool creation engine in combination with the data profile engine builds custom solutions and product interfaces.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of U.S. application Ser. No.12/560,358, filed Sep. 15, 2009, which claims the benefit of U.S.Provisional Application No. 61/097,033, filed Sep. 15, 2008, each ofwhich is herein incorporated by reference in its entirety.

FIELD OF THE INVENTION

A data profile engine for identifying, classifying, analyzing,searching, comparing and cross-referencing entire files and sections offiles, records and other forms of electronic media is provided. A toolcreation engine is also provided in combination with the data profileengine for building custom solutions and product interfaces.

BACKGROUND

Search and classification systems are measured by their precision (theability to find only relevant materials) and their recall (their abilityto find all relevant documents). However, the variability of language,including synonymy (different words describe the same idea) and polysemy(words having two or more meaning) limit the accuracy of current searchengines. The result is that researchers may not find all useful materialor may have to review volumes of search results in order to find usefulmaterial. Moreover, current search engines are limited in their capacityto accurately find a particular section from a certain type of document.

In addition to finding and classifying materials, a number of industriesand service sectors rely upon standard documents and forms to ensurequality and accuracy of their documents. A number of industries havemanually constructed and maintained professional document standards andtemplates. For example, the American Institute of Architects (AIA) andthe American Society of Landscape Architects (ASLA) maintain standardbuilding and planning specifications. In the legal field, theInternational Swaps and Derivatives Association (ISDA) provides standardmaster documents for complex financing. Templates also exist for projectmanagement, software development, and many other professional practices.Currently, in order to generate a standard template, experts must readand review numerous documents, manually create the outline and identifystandard text associated with each section and alternative examples ofeach. Due to the breadth, complexity and expense of the process,standard documents are not broadly available for all professionaldocuments.

SUMMARY

In view of the aspects described above, provided herein is a dataprofile engine that captures the “digital signature”—or characteristicsof electronic media and their component elements. The textualcharacteristics of text blocks, together with hierarchical relationshipinformation of the blocks to each other enables new tools to be createdfor accurate text search, classification and data mining.

In certain embodiments, a method for identifying a relationship betweena plurality of data files comprising word text, which may be implementedin a data profile engine. The method includes receiving at a processorthe plurality of data files from one or more computer databases;deconstructing the data files into one or more text blocks; creating adata profile for each data file and for the one or more text blocksassociated with each data file, each data profile comprising astatistical signature for a set of data forming the corresponding datafile or text block compared to the plurality of data files; and storingthe data profiles on the one or more computer databases.

In another implementation, a tool creation engine may be provided, whichis based on the data profile engine, and generates document standardsand templates. Accordingly, a system for analyzing a document comprisingtext is provided, and may be considered a template and/or benchmarkingtool. In this implementation, a processor is configured to receive thedocument and perform the steps of deconstructing the document into oneor more text blocks; creating a data profile for each of the one or moretext blocks, each data profile comprising a statistical signature for aset of data forming the text block; and comparing the data profile foreach of the one or more text blocks with a template stored on a computerdatabase, the template comprising data profiles for matching text blocksfrom a source set of documents; and a user interface is coupled to theprocessor for displaying an indication of similarity of the documentcompared to the template, the indication of similarity comprisingstatistical measure of frequency for matching text blocks.

In yet another implementation, a system for preparing a documentcomprising text is provided, which may be configured as a documentdrafting and reuse tool. In this implementation, a processor isconfigured to receive the document and perform the steps of:deconstructing document into one or more text blocks; creating a dataprofile for each of the one or more text blocks, each data profilecomprising a statistical signature of a set of data forming the textblock; and comparing the data profile for each of the one or more textblocks with data profiles associated with a model document stored on acomputer database, the model document comprising a plurality ofstatistically similar text blocks from a source set of model documents;and a user interface coupled to the processor is for displaying defaultstandard clauses, alternative clauses and infrequently used clausesbased on the source set of model documents.

In another implementation, a system for searching entire files andsections of files provides a processor configured to receive a pluralityof documents and perform the steps of deconstructing the plurality ofdocuments into one or more text blocks; creating a data profile for eachof the one or more text blocks, each data profile comprising astatistical signature of a set of data forming the text block; andcomparing the data profile for each of the one or more text blocks withdata profiles associated with a model document stored on a computerdatabase, the model document comprising a plurality of statisticallysimilar text blocks from a source set of model documents. The systemadditionally includes a user interface coupled to the processor forentering a search, the search comprising search terms, sections captionsand/or text of a similar section of a user document compared to themodel document, as well as a display for displaying the search results.

These and other features and advantages of the present invention willbecome apparent to those skilled in the art from the following detaileddescription, wherein it is shown and described illustrative embodimentsof the invention, including best modes contemplated for carrying out theinvention. As it will be realized, the invention is capable ofmodifications in various obvious aspects, all without departing from thespirit and scope of the present invention. Accordingly, the drawings anddetailed description are to be regarded as illustrative in nature andnot restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts a method for identifying a relationship between a numberof data files including word text.

FIG. 2A depicts text blocks in a sample document according to certainimplementations.

FIG. 2B depicts text blocks and sub-text blocks according to certainimplementations.

FIG. 3 depicts text blocks groups according to certain implementations.

FIG. 4 depicts text blocks group sets provided according to certainimplementations.

FIG. 5 depicts a workflow process according to certain implementations.

FIG. 6 depicts a resume structure.

FIG. 7 depicts a template outline view provided according to certainembodiments.

FIG. 8 depicts a template index view.

FIG. 9 depicts a standard template showing the frequency of occurrenceof each section of the template.

FIG. 10 depicts a text block outline structure of a document.

DETAILED DESCRIPTION 1.1. Data Profile Engine

The data profile engine analyzes multiple source files (e.g., electronicmedia including documents and records), identifies sections of text ineach file (a “text block”), matches text blocks within a file and acrossall files by constructing collections of matched text blocks (a “textblock group”) resulting in multiple sets of text block groups (a “textblock group set”), and building a statistical signature of each textblock, text block group, and text block group set, identifying theircommon and distinguishing attributes (a “data profile”).

The data profile captures information including related text blocks,text block groups, and text block group sets, captions, alternativecaptions, document levels, text block sequences,parent-sibling-and-child relationships, words, and word weights, wordmatch scores for each text blocks and its sub-blocks, and its wordclusters. An example of a data profile is attached in Exhibit 5. Dataprofiles are created for the document as a whole and its text blocks.The data profile can then be used to match similar text blocks andorganize entire file into common framework.

FIG. 1 depicts a method 100 for identifying a relationship between anumber of data files including word text, which may be implemented by adata profile engine. The method involves receiving at a processor theplurality of data files from one or more computer databases (operation110). The received data files are deconstructed into one or more textblocks (operation 120). For each data file and text block, a dataprofile is created, which is a statistical signature for a set of dataforming the corresponding data file or text block as compared to theplurality of data files (operation 130).

The data profiles may be stored on a data profile database (operation140), and may be subsequently used in connection with the tool creationengine.

1.2. Tool Creation Engine and Product Interfaces

The tool creation engine utilizes the data profile engine and its dataprofiles to facilitate the creation of a wide range software tools toanalyze collections of files or individuals files; performing servicessuch as document organization, research, reuse and validation. Anexample of a tool creation engine is a document template engine thatuses the data profiles of a text block group set to construct anaggregated document outline based on the content and structure of allcompared documents. The aggregated outline can be applied to userselected documents, or any part of a document to decompose, classify,extract metadata, benchmark, search, mark-up common language, identifyimportant document specific language, identify default provisions, andgroup results by common attributes. Product interfaces are used torender the results of these tools in a human readable fashion and can beadapted to different user domains (e.g. architecture, banking, resumes,law, or general use).

2. Description of the Embodiments

2.1. Data Profile Engine

The data profile engine operates in either an untrained or unsupervisedmode of operation, or it may be configured to utilize information inexisting data profile. In the untrained mode, the engine starts with nopre-existing data and constructs data profiles using configurablesettings and formulas. Using the initial settings and formulas, enginebuilds information regarding the source set of documents to identifymore and more detailed and variant statistical signatures, or dataprofiles, for text blocks appearing the source file set and thereby thetypes of files represented by its domain area. Using this approach thedata profile engine may build and continue to refine knowledge of legalagreements, architecture specifications, resumes or any other type ofdocument.

2.1.1 Core Concepts

(1) Entities

(a) Words (Word Universe)

The word universe index is a word statistical index containing wordfrequencies, word weights, and statistical information including:minimum value, maximum value, mean value, median value and deviation. Itis generated from a large set of sample documents and is used to measurecommonality, mean and divergence in a broad set, of documents comparedto the word occurrences in specific documents, and word occurrences inspecific text blocks.

(b) Text Blocks

A text block is a section of words from a file. It may consist of anentire file, many paragraphs, a single paragraph or even a single word.Depending on the type of source documents, text blocks may be organizedat a single level or they may be hierarchically nested (i.e. a textblock may contain other text blocks), as shown in FIG. 2A.

In a hierarchical domain of documents, text blocks can be grouped in ahierarchy recording their sequence and level. In FIG. 2A, text blocks 1,2 and 3 are examples of level 1 and incorporate the characteristics ofthe nested text blocks within its scope. Text blocks 4, 5, and 6 areexamples of level 2 text blocks nested within text block 2. Text blocks7, 8 and 9 are examples of level 3 text blocks nested within text block5. The sequence of each text block follows the order in the document;the hierarchical level provides information regarding the depth of aparticular section. The combination of sequence and level and ancestraltrail determines the location of the text block in the document.

In a hierarchical domain of documents, nested text blocks are treated assub-text blocks of its parent. In the example of FIG. 2A, text blocks 4,5 and 6 are sub-text blocks of text block 2. Text blocks themselves maybe composed of sub-text blocks. For example a caption heading may beconsidered a sub-text block of a text block comprising a paragraph asshown in FIG. 2B. Sub-Text blocks may also be composed of hierarchicallynested text blocks.

The scope of a text block is defined by program rules (“formulas”)identifying paragraphs, captions, matching words and matching wordclusters in other documents.

Depending on the type of source documents, text blocks may be organizedat a single level or they may be hierarchically nested (i.e. a textblock may contain other text blocks), as shown in FIG. 2B.

(c) Text Block Groups

A text block group is a set of matching text blocks sharing sufficientcommon characteristics as defined by formulas, scores and thresholds.FIG. 3 shows the example of two resume files and finds that text block 1in file 1 has shared characteristics with text block 2 in file 2. Inthis example, the two text blocks are grouped together in the set textblock group 1 and the statistical characteristics of a text block groupare stored as data profile, capturing the individual and aggregatecharacteristics of each text block and all related text blocks.

(d) Text Block Group Sets

By analyzing the shared characteristics of each text block group, setsof text block groups are identified. In FIG. 4, a text block captioned“other interests” is found and would likely match similar text blocks infurther files and generates a text block group for “other interests”. Bycomparing the text block groups for “hobbies” and “other interests,” theengine determines that “hobbies” and “other interests” share sufficientcommon attributes to be treated as a text block group set.

(2) Formulas and Scores

The engine uses configurable formulas to identify and match fileelements. The scoring examples shown in this document are illustrativeof the methodology applied. In practice, the engine calculates numerousstatistical metrics known to one skilled in the art. A complete set offormulas is shown in Exhibit 4 detailing the pseudo-code for textblock-to-text block group matching. All other processes mirror thisprocess, but are not detailed in each section for purposes ofunderstanding the overall methodology. A sample configuration file usedto set threshold scores is attached as Exhibit 1. A sample output scorefor each statistical metric is attached as Exhibit 2.

(a) Base Scores (Entity Base Statistics)

Each entity (i.e. words, text blocks, text block groups and text blockgroup set) is given one or more base scores for its characteristicsusing configurable formulas.

(b) Match Scores (Comparison Entity-to-Entity)

When comparing one entity to another, the engine computes match scoresfocusing on a range of characteristics such as distinguishing words,leading words, and caption heading words. Each type of match score iscreated using configurable formulas to determine the level and scope ofthe shared characteristics of the entities being compared.

(c) Resolution Scores (Organization Statistics)

In many cases, match scores yield different comparative metrics.Resolution scores, again based on configurable formulas, determine theprocessing methodology for particular tasks, such as generating atemplate (see tool creation engines, template generation in relation toFIG. 5) or matching text in a new document to a set of data profile.Resolution scoring methods include determining the best match,identifying redundant matches, and assessing the sequence and level oftext blocks in a domain of files that possess hierarchical attributes.

(3) Data Profile Creation

The statistical signatures of text block groups are known as dataprofiles. Data profiles are constructed using an iterative process ofbuilding base scores, match scores and resolution scores. The processgenerally follows the sequence of comparing:

-   -   words-to-words    -   text blocks-to-text blocks    -   text block groups-to-text block groups    -   text blocks-to-text block groups    -   sets of text block groups-to-set of text block groups.        The process, repeated numerous times, builds the statistical        metrics used to identify, compare, and organize text blocks in a        chosen set of files.

2.1.2 Data Profile Creation Methodology

In FIG. 5, a workflow depicts certain implementations of the dataprofile creation methodology process in which a data profile engine 500generates data profiles for use in generating tools 540. A data profileengine 500 provided on a computer system receives a set of files 501,which may be cleaned and converted 502, analyzes the files to capture astatistical signature of the files and sections within files, andperforms the steps of capturing the identifying and distinguishingattributes and establishing organizational relationships to other filesand sections with such other files.

According to certain implementations, a word universe 503 captures thestatistical attributes of words in a comprehensive set of files, whichmay be the same or similar to the set of files 501, and measures wordfrequency across the files and within sections of such files.

The data profile engine 500 processes the files using a series offormulas to identify text sections and sub-sections (“text blocks” and“sub-text blocks”) within documents using lexical and formattingattributes of words and files 504. This file processing decomposes thefiles into text blocks 505. For example, discrete text blocks areidentified in each file by comparison to the statistical signatures oftext sections created during the processing sequence or drawn fromexisting data profiles.

Each text block undergoes a matching process 506, which may involveusing a series of formulas for computing statistical scores for eachtext block in each file. For example, a base statistical score of eachtext block in each file may be compared to generate match scores.Formulas may be applied to the scores for resolving differentcomparative metrics and determining match or shared characteristics 507.Text blocks are matched within each file and across a set of files.

Matching sets of text blocks sharing common characteristics are groupedinto text block groups 508. In certain implementations, redundant orduplicate text block groups are identified and/or eliminated. For eachtext block group, a data profile is created 509, which is a definitionof the statistical signature of the text block group. The data profilemay capture a base statistical score, shared word characteristics, fileorganization characteristics and alternative match results. Text blockgroups are compared with each other to determine their matchingcharacteristics, and statistically similar text block groups may bemerged, and the text block group's data profile may be updated andrefined. In some implementations, certain text blocks are initiallyunmatched to a text block group either because the text block does notmatch any group at a particular instance as the text block groups arebuilt, or because the text block group has not been processed by thedata profile engine. Accordingly, the unmatched text blocks are comparedto text block groups defining a base score, a match score and a set ofresolution scores and matched 510.

Text block group sets are generated by comparing text block groups toother text block groups, and those that share common characteristicsform a set. The text block group set may be analyzed to determine thecorrect sequence order of each text block. That is, the sets of textblock groups sharing common characteristics are compared to determinethe correct hierarchical depth of each text block 511. The comparisonmay involve implementing a set of configurable rules for iterativereprocessing capturing increasing statistical information foridentifying matching text sections with greater precision and broaderrecall.

Data profiles may be stored on a data profile database 520, andinformation related to the text blocks, text block groups, and sets oftext block groups may be provided to tool creation engines 530 in orderto create tools 540 for text block utilization. The tool creationengines 530 allow for the creation of programming tools and productinterfaces.

The tool creation engine may include a data profile and matching rulestool 541 for analyzing a user document 550 in a manner similar to thedata profile engine 500 in which the document 550 is decomposed intotext blocks and matched with a text block group.

A document drafting and reuse tool 542 may deconstruct a user document550 and identify text blocks based on paragraphs, captions headings, andstatistical comparison to document template and text block dataprofiles. In certain implementations, the user document 250 and documentsections may be used as data for use in finding alternative matchingsections from a set of documents. In addition, the document drafting andreuse tool 542 may identify an exemplar section from the set of matchingsections that most closely conforms to a data profile.

The tool creation engine 530 may generate a document validation tool543, which may be configured for identifying language conformity inparticular sections of a document 550 compared to the data profile. Inanother embodiment, the document validation tool 543 is for identifyingcommon language or key text based on statistical analysis of wordclusters found in text blocks. Data may be extracted for identificationof document specific information (such as names, places, dates andamount) applying the statistical comparison of file text blocks totemplate data profile to a word universe. In another embodiment, thedocument 250 may be analyzed for identifying and displaying standardlanguage and document specific (or negotiated) language. Document 250may also be analyzed for identifying the default text block from alldata profiles as the text block that contains the most common words,containing all required word clusters and the least number ofnon-standard words. In addition, document 250 may be evaluated andselected sections compared to a program generated prototypical sectionproviding redline information, including standard language, non-standardlanguage and missing language in a redline format, e.g., underlining,bolding or highlighting standard language, striking through non-standardlanguage, and underlining or italicizing missing language.

Meta-data extraction tool 544 may retrieve metadata from user document550 by using rule-based techniques to capture titles, headings, names,places and dates.

The tool creation engine 530 may generate a template generation tool 545for automatic generation of document templates and text block dataprofiles constructed from multiple source documents, e.g., 501, 551,sample files, etc., creating a single, aggregated outline and index ofall distinguishing documents elements that may be displayed as ahierarchical tree or alphabetical list.

In another embodiment, the template generation tool 545 is for automaticgeneration of master templates combing individual document typetemplates into a single universal template. For example, combining alllegal agreements into a universal template containing all documentelements from all the individual document templates, themselvesgenerated from individual source documents. The master template can beconstructed as a market standard, a checklist or universe of all textblocks.

The template generation tool 545 may be configured for automaticgeneration of model templates or documents for particular document types(such as a merger agreement or landscape planning specification)applying mutually exclusive rules and text block redundancy rules toremove similar text blocks found in different locations in individualdocuments and determine the most frequent location of each templateoutline element. A model document or templates includes a number ofstatistically similar text blocks from a source set of model documents.

The template generation tool 545 may be also be configured for automaticgeneration of checklist templates identifying all common variants ofeach document hierarchical elements (above a programmatically oruser-defined threshold) including redundant text blocks that may befound in different locations in individual documents.

The template generation tool 545 may generate universal templatesidentifying all distinguishing sections and text blocks in the analyzeddocument set, including unique text blocks. Derivative templates mayalso be created from a small source set of files, applying the dataprofile information from a larger set of similar files.

Furthermore, templates may be analyzed, for example by identifying thefrequency that each text block appears in the template. Templates may beanalyzed to identify the closest matching templates associated for aparticular document applying template matching rules.

In addition, a methodology for editing the templates and checklists togenerate customized outlines and text block components to captureexemplar practice or the distinct needs of particular documentrequirements may also be implemented by the template generation tool545.

Next, the tool creation engine 530 may generate a document and textblock classification tool 546 in which user documents 650 arecategorized based on text extraction and statistical comparison todocument template and text block data profiles. In another example, thedocument and text block classification tool 646 categorizes topicalsections of documents 550 based on document decomposition, statisticalcomparison to document templates and text block data profiles.

According to certain implementations, the tool creation engine 530 maygenerate a key language highlighting tool 547 in which a user document550 is analyzed and the engine highlights standard terms in a documentor section of document showing standard language and non-standardlanguage. A lawyer, for example, may select a particular clause in aLoan Agreement and highlight and distinguish the standard language ofthe clause, the deal-specific language and key deal terms (such a partynames, jurisdictions, dates, amounts etc.).

The tool creation engine 530 may generate a document and text blocksearching tool 548 for finding entire documents using a file 550 as thesearch pattern by comparing the file 550 to all template profiles andcomparing titles (if available), captions, length, and distinguishingtext blocks uniquely associated with a given template. In anotherembodiment, the document and text block searching tool 548 may identifymatching sections using any section of a document, e.g., search terms,captions, sections of text or entire documents, as the search pattern,comparing the document fragment and the source file from which itderives to template profiles and data profiles.

The tool creation engine 530 may generate a document and text blockbenchmarking tool 549 for analyzing similarity and divergence of anentire document 550 compared to a template providing: (a) statisticalmeasure of frequency for all matching text blocks and (b) a list of alltext blocks missing from the selected document compared to the selectedtemplates. In another embodiment, the document and text blockbenchmarking tool 549 is for analyzing similarity and divergence ofselected text blocks or text fragments compared to the templateproviding a statistical measure of language commonality and variancecompared to the a text block data profile.

Aspects of data profile creation methodologies are further described inthe example below, in which the methodology is applied to create dataprofiles in an untrained mode of operation using the example of a set ofresumes used for employment application. The example illustrates thesteps of the processing of a sample of 1,000 resumes and describes howthe resulting text block group set is constructed.

The example assumes the general structure of resumes contains thesections listed in FIG. 6. However, notably, the system is not giventhis information; the data profile engine deduces the structure applyingengine rules and logic.

(1) Word and File Analysis

(a) Create Word Universe

The word universe is generated from a large set of sample documents andis used to measure word commonality, mean and divergence in a broad setof documents compared to the word occurrences in specific documents, andword occurrences in specific text blocks. The examples shown illustratethe scoring methodology limited to averages. In addition, scoringapplied by the data profile engine calculates and measures otherstatistical measures, such as mean, mode and deviation.

The word universe statistical score may be built by measuring wordoccurrences in entire files or text blocks within such files. Forpurposes of measuring word occurrences in text blocks, an identificationformula using special characters, such as carriage returns and line feedis used. By measuring word frequency in text blocks, the word universecan determine whether words appear in distinct sections of files.

In addition to single word occurrences in files and text blocks, theword universe also detects word pairs, word triplets, etc. The examplein Table 1 shows an example of statistical measures for word occurrencesin files.

(i) Formulas and Scores (Base Score)

TABLE 1 Word Universe Fragment Total Documents 84,472 Score Score ScoreScore Score Score Word 1 2 3 4 5 6 A 80,249 2,487,719 0.950 29.450 311.053 AA 9,292 204,424 0.110 2.420 22 9.091 AARON 845 11,830 0.010 0.14014 99.967 ABA 16,050 288,900 0.190 3.420 18 5.263 ABACK 17,740 53,2200.210 0.630 3 7.762 ABAMYNET 761 9,893 0.009 0.117 13 111.001 ABANNDON12,671 38,013 0.150 0.450 3 6.667 ABASE 2,535 5,070 0.030 0.060 2 33.322ABATE 11,827 82,789 0.140 0.980 7 7.142 ABBA 169 3,042 0.002 0.036 18499.834 . . . . . . . . . . . . . . . . . . . . . Score 1: Word occurredin files Score 2: Word occurrences total Score 3: Word frequency in allfiles (Score 1/Total Documents) Score 4: Word average in all files(Score 2/total Documents) Score 5: Average occurrences per file indocuments containing the word (Score 2/Score 3) Score 6: Word frequencyper document (Score 5/Score 4)

The word universe identifies important attributes of each word. Forexample the words “AARON”, “ABAMYNET”, and “ABBA” are very rare acrossthe documents (score 3), but when they do occur, these words occurfrequently (score 6). Such high frequency words in a file compared totheir average frequency in all files indicates that these words are keyterms in the file in which they appear. In contrast, words such as “A”occur in approximately the same level of frequency in individual filesand all files, indicating that such words are common andnon-distinguishing.

(b) Text Blocks and Text Block-to-Text Block Comparison

(i) Identify Text Blocks within a File

The identification of text blocks uses configurable rule sets toidentify demarcation characteristics and decomposes input documents intotext blocks. Depending on the type of documents processed, the rule setmay incorporate the following formulas:

Special Characters:

Any set of words delimited by special characters “\r\n” (carriage returnand line feed).

Captioned Sections:

Any set of words marked with an opening caption, where the caption isidentified by sufficient elements of: (i) a caption numbering prefix,(ii) a caption title, (identified by formatting characteristics(uppercase, initial capitals, underline, bold, font changes, etc.), and(iii) and a postfix (typically a colon, semicolon, or comma).

(ii) Formulas for Scoring Text Blocks Using the Word Universe (BaseScore)

Each text block is attributed a base score calculated by the sum of itsword weights for each word in the text block. The word weights may becarried forward from the word universe or recomputed comparing wordfrequency in files to word frequency in text blocks.

The calculation of the text block base score is illustrated using thetext in FIG. 3. the sample text shows two text blocks taken from twodifferent resumes files. Table 2 shows the calculation of text block 1from the first resume file.

TABLE 2 Text Block 1 - Base Score Text Block 1 Word Score 1 Score 2HOBBIES 0.1 1.00 IN 0.88 0.10 HIS 0.75 0.10 LEISURE 0.21 1.00 TIME 0.231.00 JACOB 0.06 0.50 TEGGILL 0.002 0.00 ENJOYS 0.22 1.00 THE 0.91 0.10ACTIVITIES 0.5 1.00 OF 0.74 0.10 ARCHERY 0.03 0.50 AND 0.96 0.10BASEBALL 0.05 0.50 AND 0.96 0.10 OFTEN 0.66 1.00 RELAXES 0.15 1.00 BY0.72 0.10 DRAWING 0.07 0.50 OR 0.95 0.10 READING 0.08 0.50 A 0.95 0.10BOOK 0.11 1.00 Score 3 11.40 Score 1: Word Commonality Score 2: WordWeight Score 3: Text Block Base core

Word Commonality (Score 1):

calculates the word frequency in each file as a percentage of all filesapplying Formula 1.

Text Block Formula 1:

word commonality=number of files containing a word/total number of filesin the word universe

In the example, the base commonality scores for text block 1 are shownin Table 1, displaying the word frequencies for each word in the textblock. It will be understood that formulas shown are illustrative of themethodology used and not comprehensive of all formulas used in thetemplate engine.

Word Weight (Score 2): calculates the word weights applying an adaptableformula.

Text Block Formula 2:

-   -   Where the percentage of files in which the word occurs is        greater than 70%, the word weight assigned is 0.1 (e.g., high        frequency words are weighted relatively lower because they are        non-distinguishing and include “noise” words such as “and”,        “of”, “the”, etc.)    -   Where the percentage of files in which the word occurs is        greater than 10% and less than 70%, the word weight assigned is        1.0    -   Where the percentage of files in which the word occurs is        greater than 0.01% and less than 10%, the word weight assigned        is 0.5    -   Where the percentage of files in which the word occurs is        greater than 0.01%, the word weight assigned is 0.0 (e.g., low        frequency words are typically proper nouns (such as names and        places), dates, amounts and misspellings)        The Word Weight Scores for Text Block is shown in Table 2.

Text Block Base Score (Score 3):

calculates the total of all word weights in the text block.

Text Block Formula 3:

Text Block Base Score=Sum (Base Word Weights in Text Block)

The same methodology is applied to other text blocks to compute theirbase score. Table 3 shows the base score calculation for text block 2from the second resume file.

(iii) Matching Two Text Blocks (Match Scores)

After generating text block scores for each file, text blocks in eachfile are compared to text blocks in every other file to measure thedegree of commonality of match. The example shows the text block matchscores for text block 1 and text block 2.

TABLE 3 Text Block 2 - Base Score Text Block 2 Word Score 1 Score 2HOBBIES 0.1 1 JACK'S 0.06 0.5 LEISURE 0.21 1 ACTIVITIES 0.5 1 INCLUDE0.34 1 RIDING 0.12 1 HORSES 0.06 0.5 SKATING 0.04 0.5 AND 0.96 0.1SPENDING 0.28 1 TIME 0.23 1 WITH 0.76 0.1 HIS 0.75 0.1 FAMILY 0.12 1Score 3 9.8 Score 1: Commonality Documents Score 2: Word Weight Score 3:Text Block Base Score

The match scoring methodology compares any two text blocks and applies anumber of different statistical measures to assess similarity anddifference, each of which may generate a different score. Examples ofthe formulas include:

-   -   Calculating the sum and word score for all words in each text        block    -   Calculating the sum and match score for matching words in each        text Block    -   Calculating the sum and match score for high weighted words in        each text block    -   Calculating the sum and score for matching words near the        beginning of each text block    -   Calculating the sum and score for matching captioned words in        each text block

Other methods used, but not illustrated, include calculating scores forword pairs, word triplets and word clusters. The data profile engineapplies a number of different matching formulas that can be configureddepending on the domain set of document analyzed.

The example in Table 4 shows the match results for comparing text block1 and text block 2. Matching or common words appearing in both textblocks are bolded and italicized.

TABLE 4 Text Block Comparison Score Score Score Word 1 2 4 Text Block 1A 95% 0.10

50% 1.00 1.00

96% 0.10 0.10 ARCHERY  3% 0.50 BASEBALL  5% 0.50 BOOK 11% 1.00 BY 72%0.10 DRAWING  7% 0.50 ENJOYS 22% 1.00

75% 0.10 0.10

10% 1.00 1.00 IN 88% 0.10 JACOB  6% 0.50

21% 1.00 1.00 OF 74% 0.10 OFTEN 66% 1.00 OR 95% 0.10 READING  8% 0.50RELAXES 15% 1.00 TEGGILL  0% 0.00 THE 91% 0.10 TIME 23% 1.00 1.00 Score3 11.3 Score 5 4.2 Score 6 0.37 Score 7 9.0 Score 8 6.0 Score 9 0.7 TextBlock 2

50% 1.00 1.00

96% 0.10 0.10 FAMILY 12% 1.00

75% 0.10 0.10

10% 1.00 1.00 HORSES  6% 0.50 INCLUDE 34% 1.00 JACKS  6% 0.50

21% 1.00 1.00 RIDING 12% 1.00 SKATING  4% 0.50 SPENDING 28% 1.00 TIME23% 1.00 1.00 WITH 76% 0.10 Score 3 8.8 Score 5 4.2 Score 6 0.48 Score 78.0 Score 8 6.0 Score 9 0.8 Score 1: Word Commonality Score 2: WordWeight Score 3: Word Weight Sum Score 4: All Words Weight Sum Score 5:Match Word Weight Sum Score 6: Text Block Matching Words Base Score(Score 5/Score 3) Score 7: High Weight Words Sum (where word weight = 1)Score 8: High Weight Words Match Sum Score 9: Text Block High WeightWords Base Score (Score 8/Score 7)

Matching Word Formulas:

Matching words in the text block examples are italicized.

Text Block Formula 1 (Score 5):

Word Weight Match Sum=Sum (Word Weight Scores for all matching words)

Text Block Formula 2 (Score 6):

Text Block Base Score (based on matching words)=Word Weight MatchSum/Total Word Weight Score

High Word Weight Formulas

Text Block 1 HOBBIES: In his leisure time, Jacob Teggill enjoys theactivities of Archery and Baseball and often relaxes by drawing orreading a book. Text Block 2 HOBBIES: Jack's leisure activities includeriding horses, skating, and spending time with his family.

High frequency word formulas focus on the most distinguishing words inthe text blocks. In the example, the high weight words are italicized.The resulting formulas yield a score of 0.67 for Text Block 1 and 0.75for Text Block 2.

Leading Words Formulas

Text Block 1 [HOBBIES: In his leisure] time, Jacob Teggill enjoys theactivities of Archery and Baseball and often relaxes by drawing orreading a book. Text Block 2 [HOBBIES: Jack's leisure] activitiesinclude riding horses, skating, and spending time with his family.

Key topical features of a text block are frequently found at the startas authors typically draft from general to specific. Applying thiscircumstance, scores can be calculated based on a defined number ofleading words in each text block. In the example, the first 4 words areused to generate a leading word score. The formula definitions apply thesame methodology as used high word weight formulas. The resultingformulas yield a score of 1.0 for both text block and text block 2.

Caption Words Formulas

Text Block 1 [HOBBIES]: In his leisure time, Jacob Teggill enjoys theactivities of Archery and Baseball and often relaxes by drawing orreading a book. Text Block 2 [HOBBIES]: Jack's leisure activitiesinclude riding horses, skating, and spending time with his family.

Section captions also provide valuable information about the topicalcontent of the Text Block. The formula definitions apply the samemethodology as used high word weight formulas. the resulting formulasyield a score of 1.0 for both text block and text block 2. As detailedin later sections, the data profile engine stores the caption wordsequence and its scores as a sub-text block.

(iv) Resolution Scores

The text block matching formulas provide different comparison metricsfor each text block. Formulas and rules can then be applied to solveparticular tasks. For example, if the task is to identify whether twotext blocks match, a text block match rule may be applied. Using theexample, some of the potential match score are shown in Table 5.

TABLE 5 Match Scores Based on Different Formulas Text Block 1 Text Block2 Matching Words 0.37 Matching Words 0.48 (Score 6) (Score 6) HighWeight Words 0.67 High Weight Words 0.75 (Score 9) (Score 9) Sub-TextBlock 1 -- 1 Sub-Text Block 1 -- 1 Leading Words (*) Leading Words (*)Sub-Text Block 2 -- 1 Sub-Text Block 2 -- 1 Caption Words (*) CaptionWords (*) (*) Calculation details not shown; based on prior examples

Text Block Match Rule Examples:

Text Block Resolution Formula 1:

-   -   If Text Block 1 Score 6 and Text Block 2 Score 6 are both        greater than 0.4, then it is a match.        This rule identifies that significant number of the        distinguishing words matched.

Text Block Resolution Formula 2:

-   -   If Text Block 1 Score 6 and Text Block 2 Score 6 are both        greater than 0.2 and Text Block 1 Leading Word Score 6 and Text        Block 2 Leading Word Score 6 are greater than 0.8, then it is a        match.        This rule identifies than some of the distinguishing words        matched and the leading words indicated a strong match.

(2) Data Profile Creation

The data profile engine uses a multi-pass methodology to build dataprofiles, iterating through all or some of the steps described in eachpass. The order of these steps outlined will vary, dependent onconfiguration options and the nature of the document domain. In somecases they may be less linear. The steps outlined below show the typicalcomposition of the first pass through the process.

(a) Text Block-to-Text Block Comparison

(i) Identify Text Blocks

The identification of the scope text blocks is refined during theiterative procedure, in the initial identification of text blocksdescribed above, the data profile engine uses lexical and formattingcharacteristics to identify text blocks. As more detailed informationabout the text blocks is constructed, data profile matching is used isidentify and demark text blocks. In addition, data profile informationfrom prior processing can be used, which is described further in thebootstrapping example—document decomposition.

(ii) Match Text Blocks Across Files

As the data profile engine builds information about each text block, itscharacteristics and its organizational relationships to other textblocks, the engine is tuned for strict matching. In other words, theinitial passes are optimized for greater precision to find only thosetext blocks that are strong matches. In the current example, text blockmatching is performed between different files, but text matching mayalso be applied within files.

The results of matching process create text block groups which are setof text blocks sharing similar characteristics as identified byformulas.

(b) Text Block Group-to-Text Block Group Comparison

(i) Create Initial Text Block Groups

In the current example, initial sets of matching text blocks areidentified by a sub-text block caption word formula, described in2.1.1(1)(b)(iii), in order to generate high precision matches.

Text Block Group Match Formula:

-   -   Where Text Block 1 Caption Word Score is greater than 0.9 And        Text Block 2 Caption Word Score is greater than 0.9, combine the        Text Blocks as an initial Text Block Group.        Where 1,000 resumes are processed, the results are shown in        Table 6 showing the text blocks groups and the number of        matching text blocks in each group. In addition, many text        blocks did not match.

TABLE 6 Initial Text Group Results Number of Text Text Block GroupBlocks Objective 950 Experience 920 Education 870 Clubs and Affiliations210 Hobbies 190 Other Interests 150 Contact 850

(ii) Redundancy Check

Redundancy rules eliminate duplicate variants. In many cases, individualdocuments may use a different caption heading to describe the same orsimilar concept. In the example, “Hobbies” and “Other Interests” likelycapture a similar subject matter. Analysis of the files shows that anumber of different files have a category for “Hobbies” and “OtherInterests,” but very few have both. The redundancy rules define themetrics for identifying duplicate variants and the least occurringforms.

TABLE 7 Text Group Comparison - Calculating Redundancy Text Block GroupText Block Group Score 1 Score 2 Objective Experience 805 0.875Objective Education 786 0.903 Objective Clubs and Affiliations 155 0.738Objective Hobbies 97 0.511 Objective Other Interests 107 0.713 ObjectiveContact 737 0.867 Experience Education 834 0.959 Experience Clubs andAffiliations 175 0.833 Experience Hobbies 116 0.611 Experience OtherInterests 58 0.387 Experience Contact 757 0.891 Education Clubs andAffiliations 184 0.876 Education Hobbies 126 0.663 Education OtherInterests 116 0.773 Education Contact 747 0.879 Clubs and AffiliationsHobbies 136 0.716 Clubs and Affiliations Other Interests 107 0.713 Clubsand Affiliations Contact 737 0.867 Hobbies Other Interests 2 0.013Hobbies Contact 189 0.995 Other Interests Contact 135 0.900 Score 1:Count of files containing both Text Block Groups Score 2: RedundantCount/Total Number of Files (%)

The sample formulas calculate the following scores:

Redundancy Formula 1 (Score 1):

Redundant Count=Count of files containing both Text Block Groups

Redundancy Formula 2 (Score 2):

Redundant Score=Redundant Count/Total Number of Files

The redundant scoring detects that “Hobbies” and “Other Interests”appear in the same resume very infrequently. Where the word commonalityscore further indicates that the section covers the same topic, the textblock group “Hobbies” is retained because it appears in more files than“Other Interests”.

The methodology of fine tuned using additional match scores includinghigh weigh words, leading words, captioned words, and word clusters.

(iii) Create Initial Text Block Group Profile

TABLE 8 Text Block Group Data Profile Text Block Group1 1,000 Number ofFiles Number of Text Block Groups 6 Score Score Score Score Score ScoreScore Word 1 2 3 4 5 6 6 ACTIVITIES 1 550 0.55 2 0.33 0.67 0.37 HOBBIES1 360 0.36 1 0.17 0/83 0.30 INTERESTS 1 412 0.41 2 0.33 0.67 0.27LEISURE 1 244 0.24 1 0.17 0.83 0.20 TIME 1 265 0.27 2 0.33 0.67 0.18FAVORITE 1 212 0.21 1 0.17 0.83 0.18 ENJOYS 1 242 024 2 0.33 0.67 0.16THING 1 312 0.31 3 0.50 0.50 0.16 SPENDING 1 308 0.31 3 0.50 0.50 0.15RELAXES 1 165 0.17 1 0.17 0.83 0.14 INCLUDE 1 374 0.37 4 0.67 0.33 0.12SECOND 1 363 0.36 4 0.67 0.33 0.12 FREE 1 198 0.20 3 0.50 0.50 0.10EXPLORING 1 143 0.14 2 0.33 0.67 0.10 FAMILY 1 132 0.13 2 0.33 0.67 0.09RIDING 1 132 0.13 2 0.33 0.67 0.09 DISTANT 1 242 0.24 4 0.67 0.33 0.08BOOK 1 121 0.12 2 0.33 0.67 0.08 FRONT 1 220 0.22 4 0.67 0.33 0.07ALLOWS 1 330 0.33 5 0.83 0.17 0.06 PLAYING 0.5 77 0.08 1 0.17 0.83 0.03TV 0.5 77 0.08 1 0.17 0.83 0.03 READING 0.5 88 0.09 2 0.33 0.67 0.03WATCHING 0.5 88 0.09 2 0.33 0.67 0.03 BIRD 0.5 66 0.07 1 0.17 0.83 0.03HORSES 0.5 66 0.07 1 0.17 0.83 0.03 PLAYSTATION 0.5 66 0.07 1 0.17 0.830.03 DRAWING 0.5 77 0.08 2 0.33 0.67 0.03 BASEBALL 0.5 55 0.06 1 0.170.83 0.02 GARDENING 0.5 55 0.06 1 0.17 0.83 0.02 WRESTLING 0.5 55 0.06 10.17 0.83 0.02 FOOTBALL 0.5 44 0.04 1 0.17 0.83 0.02 SKATING 0.5 44 0.041 0.17 0.83 0.02 ARCHERY 0.5 33 0.03 1 0.17 0.83 0.01 JACK'S 0.5 66 0.074 0.67 0.33 0.01 JACOB 0.5 66 0.07 4 0.67 0.33 0.01 JOSHUA'S 0.5 55 0.064 0.67 0.33 0.01 TREVOR'S 0.5 44 0.04 4 0.67 0.33 0.01 OFTEN 1 726 0.736 1.00 0.00 0.00 OTHER 1 363 0.36 6 1.00 0.00 0.00 Score 8 3.40 Score 1:Word Weight Score 2: Number of files containing the word Score 3: Fileoccurrence percentage (Score 2/Number of Files Score 4: Number ofdifferent Text Block Groups containing the word Score 5: Text BlockGroup percentage (Score 4/Number of Text block Groups Score 6: One minusText Block Group percentage (1 − Score 5) Score 7: Profile Word Weight(Score 1 * Score 3 * Score 6) Score 8: Text Block Group Base Score (SumScore 7)

Data Profile scores for each Text Block Group is calculated using theexample formulas.

Data Profile Formula 1 (Score 1):

Word Weights=Word Universe Word Weight

Data Profile Formula 2 (Score 2):

-   -   Number of files containing the word

Data Profile Formula 3 (Score 3):

Percentage of files containing the words=(Score 1/Number of Files)*100

Data Profile Formula 4 (Score 4):

-   -   Number of different Text Block Groups containing the words

Data Profile Formula 5 (Score 5):

Percentage of Text Block Groups containing the Text Block CommonWords=(Score 45/Number of Text Block Groups)*100

Data Profile Formula 6 (Score 6):

Distinguishing Word Factor=Inverse Factor (1−Score 5)¹

¹ The Inverse Factor is computed because it weights words that appearless frequently across the Text lock Groups

Data Profile Formula 7 (Score 7):

Word Profile Weight Score calculating the relative value of the word asa component of the Text Block Group=Score 1*Score3*Score6

Data Profile Formula 8 (Score 8):

Data Profile Score=Sum of all Word Profile Weight Scores

Examples of other more advanced formulas include:

-   -   Squaring Score 6 and Score 3 to highlight the distinguish        characteristics of text block group comparisons.    -   Calculating percentage of text block group members that a word        occurs in and only saving words that occur in 50% or more of the        text block group members.    -   Also see examples in Section 2.2.1(3) Bootstrapping Example—Text        Block Group-to-Text Block Group Comparison

(iv) Profile-to-Profile Comparison

After generating data profile scores for each matching text blocks, eachdata profile is compared to all other data profile to measure the degreeof shared characteristics.

The data profile engine applies a number of different matching formulasthat can be configured depending on the domain set of document analyzed.The final data profile comparison score indicates the comparative matchlevel between the two data profiles.

TABLE 9 Text Block Group Profile Matching Hobbies Other Interests WordScore 1 Score 2 Word Score 1 Score 2 ACTIVITIES 0.367 0.367 INTERESTS0.361 0.361 HOBBIES 0.300 0.300 HOBBIES 0.295 0.295 INTERESTS 0.2750.275 ACTIVITIES 0.291 0.291 LEISURE 0.203 0.203 LEISURE 0.221 0.221TIME 0.177 0.177 WORKING 0.177 0.000 FAVORITE 0.177 0.177 ENJOYS 0.1770.177 ENJOYS 0.161 0.161 TIME 0.161 0.161 THING 0.156 0.000 FREE 0.1560.156 SPENDING 0.154 0.154 FAMILY 0.156 0.156 RELAXES 0.138 0.138 SECOND0.154 0.154 INCLUDE 0.125 0.125 RELAXES 0.138 0.138 SECOND 0.121 0.121SPENDING 0.125 0.125 FREE 0.099 0.099 INCLUDE 0.121 0.121 EXPLORING0.095 0.000 FAVORITE 0.095 0.095 FAMILY 0.088 0.088 FRONT 0.088 0.088RIDING 0.088 0.088 FOOTBALL 0.088 0.088 DISTANT 0.081 0.081 SPELUNKING0.086 0.000 BOOK 0.081 0.081 DISTANT 0.081 0.081 FRONT 0.073 0.073 BOOK0.081 0.081 ALLOWS 0.055 0.055 ALLOWS 0.073 0.073 PLAYINGS 0.032 0.032RIDING 0.055 0.055 TV 0.032 0.000 PLAYING 0.032 0.032 READING 0.0290.029 HUNTING 0.032 0.000 WATCHING 0.029 0.000 SKATING 0.029 0.029 BIRD0.028 0.000 JOGGING 0.028 0.000 HORSES 0.028 0.000 DRAWING 0.028 0.028PLAYSTATION 0.028 0.000 SOCCER 0.028 0.000 DRAWING 0.026 0.026 BIRD0.026 0.000 BASEBALL 0.023 0.000 READING 0.023 0.023 GARDENING 0.0230.000 BOCHE 0.023 0.000 WRESTLING 0.023 0.000 PAINTING 0.023 0.000FOOTBALL 0.018 0.018 GOATS 0.018 0.000 SKATING 0.018 0.018 BRIDGE 0.0180.000 ARCHERY 0.014 0.000 RADIO 0.014 0.000 JACK'S 0.011 0.000 JILL0.011 0.000 JACOB 0.011 0.000 CAROL 0.011 0.000 JOSHUA'S 0.009 0.000GARY'S 0.009 0.000 TREVOR'S 0.007 0.000 TOM 0.007 0.000 OFTEN 0.0000.000 WHILE 0.000 0.000 OTHER 0.000 0.000 NEVER 0.000 0.000 Score 33.401 Score 3 3.539 Score 4 2.885 Score 4 3.028 Score 5 0.8482 Score 50.8557 Score 1: Profile Word Weight Score 2: Points for matched wordsScore 3: Sum of word weights Score 4: Sum of matching word points Score5: Score 5/Score 3

Data Profile Formula 8 (Score 1):

Word Profile Weight Score=carried forward from Data Profile Score 7

Data Profile Formula 9 (Score 2):

Match Points=Where both Text Block Groups contain the word then useScore 1, otherwise zero

Data Profile Formula 10 (Score 3):

Word Weight Total=Sum (Score 1)

Data Profile Formula 11 (Score 4):

Match Points Total=Sum (Score 2)

Data Profile Formula 12 (Score 5):

Data Profile Comparison Score=Score4/Score3

Applying the formulas to the example yields the results:

“Hobbies” score 5 = 0.848 “Other Interests” score 5 = 0.856indicating that the two data profiles share many common characteristics.

Examples of other more advanced formulas include:

-   -   Ignore lower point words so as not to subtract points for words        that lack distinguishing characteristics and not to award points        to a highly ranked word in one text block group that matches a        low rank word in the other text block group.    -   Computing statistical metrics for sub-profiles based on sub-text        blocks, such as the caption heading sub-text block with the text        block encapsulating an entire paragraph.

(v) Merge Text Block Groups

Profile-to-profile matching, performed in step B.3(b) determines thatdata profiles for “Hobbies” and “Other Interests” are similar and theredundancy check performed in step B.3(c) indicates they are duplicatevariants. Merge rules are defined whereby data profile meetingconfigurable rule set are merged into one text block group. In theexample, the “Hobbies” branch will have 340 members. The profiles are berecalculated to reflect this new combined text block group.

Since this example is a data domain with captions, captions are storedas sub text blocks and sub text block groups.

TBG: “Hobbies” (340) Profile:  Sub TBG: captions   Sub TBG with caption= “Hobbies” (190)   Profile   Sub TBG with caption = “Other Interests”(150)   Profile  Other Profile information (e.g. word and filestatistics).

This allows, for example, formulas to be used that understand the subtext block groups profiles such as using the various captions associated(via sub text block groups) with the text block groups to be used tomatch another caption. As a result, the engine stores “Other Interests”as an alternative caption to “Hobbies” and this information can be usedin subsequent passes through the engine to match captions.

Output Step B.3(d) Objective (950 Text Blocks) Experience (920 TextBlocks) Education (870 Text Blocks) Clubs and Affiliations (210 TextBlocks) Hobbies (340 Text Blocks) Contact (850 Text Blocks)

(c) Text Block-to-Text Block Group Comparison

The process of creating initial text block groups is tuned for highprecision. As a result many text blocks are likely not matched to a textblock. With the additional information in the data profiles, theseunmatched text blocks can be compared to the data profiles.

(i) Base Scores

Each unmatched text block is compared to all data profiles. In thisexample, unmatched text block 5 (from file 3) is compared to the dataprofile of the hobbies text block group.

Sample Text Block 5 (from File x) LEISURE: John's hobbies include manyactivities such as wake boarding, painting and landscaping. He enjoysplaying bocce ball any time he can.

TABLE 10 Text Block-to-Text Block Group Matching Unmatched Text Block 5Hobbies Text Block Group Word Score1 Score2 Word Score1 Score2 LEISURE0.210 1 ACTIVITIES 0.367 0.367 John's 0.030 0.5 HOBBIES 0.300 0.300hobbies 0.100 1 Interests 0.275 0.275 include 0.340 1 LEISURE 0.2030.203 many 0.870 0.1 time 0.177 0.177 activities 0.500 1 favorite 0.1770.177 such 0.870 0.1 enjoys 0.161 0.161 as 0.870 0.1 thing 0.156 0.000wake 0.040 0.5 spending 0.154 0.154 boarding 0.040 0.5 relaxes 0.1380.138 painting 0.040 0.5 include 0.125 0.125 and 0.960 0.1 second 0.1210.121 landscaping 0.040 0.5 free 0.099 0.099 he 0.780 0.1 exploring0.095 0.000 enjoys 0.220 1 family 0.088 0.088 playing 0.070 0.5 riding0.088 0.088 bocce 0.040 0.5 distant 0.081 0.081 ball 0.030 0.5 book0.081 0.081 any 0.870 0.1 front 0.073 0.073 time 0.230 1 allows 0.0550.055 he 0.780 0.1 playing 0.032 0.032 can 0.870 0.1 TV 0.032 0.000Score 3 10.8 reading 0.029 0.029 watching 0.029 0.000 bird 0.028 0.000horses 0.028 0.000 PlayStation 0.028 0.000 drawing 0.026 0.026 Baseball0.023 0.000 gardening 0.023 0.000 Wrestling 0.023 0.000 football 0.0180.018 skating 0.018 0.018 Archery 0.014 0.000 Jack's 0.011 0.000 Jacob0.011 0.000 Joshua's 0.009 0.000 Trevor's 0.007 0.000 Often 0.000 0.000OTHER 0.000 0.000 Score 3 3.401 Score 4 2.885 Score 5 0.848 Score 1:Profile Word Weight Score 2: Matched Word Points Score 3: Sum ProfileWord Weight (Score 1) Score 4: Sum Matched Word Points (Score 2 Score 5:Score 4/Score 3

Applying the text block comparison methods described earlier, the systemcalculates the selected data metrics. In this example high frequencyword weight word scores and leading word scores are computed, however,additional comparison formulas may be applied depending of the type ofdocument set analyzed.

High Word Weight Comparative Scores:

Text Block Match Formula 1:

High Weight Word Match Score=sum (high frequency word weightpoints)/Word Weight (Hobbies Text Block Score 2)

Hobbies Score = 1.079/2.885 = 0.370 Text Block 5 Score = 4.50/10.800 =0.420

Leading Words Comparative Scores (First 4 Words)

Text Block Match Formula 2:

Leading Word Match Score=sum (leading word weight points)/Word Weight(Hobbies Text Block Score 2)

Hobbies Score = 0.870/1.445 = 0.760 Text Block 5 Score = 2/3.5 = 0.57

Additional match scores can be calculated and match rules appliedthrough configuration settings. (See Exhibit 3). Where the unmatchedtext block is determined to meet the rules, it is added to the textblock group, augmenting the statistical information in the profile.

(ii) Match and Resolution Scores

The data profile engine will often generate multiple strong matches fortext blocks and data profiles. Formulas and rules are applied todetermine the best matches. In the circumstance of identifying a uniqueset of text block groups, as opposed to document matching, the formulasand rules are tuned for high precision. Where the task is matching a newdocument to the data profiles the match rules are relaxed and tuned forhigher recall.

In general, the best match rules draw on information in the dataprofiles and the match results generated in the process of creating textblocks and data profiles. The example shown illustrates a simpleapplication of the approach.

Sample Best Match Rule 1:

-   -   If ((Match Score X of Text Block Group 1 is within 0.1 of Match        Score X of Text Block Group 2) and (Match Score Y of Text Block        Group 1 is greater than Match Score Y of Text Block Group 2))        -   Then Text Block Group 1 is the best match;    -   Otherwise If (Match Score X of Text Block Group 1 is greater        than Match Score X of Text Block Group 2)        -   Then, Text Block Group 1 is the best match        -   Otherwise Text Block Group 2 is the best match

Typically, resolution of best matches requires the application ofseveral formulas and scores. For example, if match score of text block Iand match score of text block 2 are both strong, then the matchattributes of other text blocks are analyzed to evaluate the degree ofmatch and what they matched.

Sample Best Match Rule 2:

-   -   If Text Block Group 1 has already been matched, then Text Block        Group 2 is considered the Best Match.

Sample Best Match Rule 3:

-   -   If Text Block is located contiguously with an unmatched Text        Block, then choose the match that is most proximate to that        other Text Block's matched Text Block Group.

It is important to note that the system retains knowledge (in the dataprofile) of why matches were assigned and what other potential matchesexist. This allows for later use of these statistics for futuredecisions.

(d) Text Block Group Sets

A text block group set is a collection of text block groups as definedby configurable formulas. An example is the set of all text block groupsin a document domain. Another example is the set of all text blockgroups that share certain characteristics across document domains.

(i) Sequencing

Text block group sets generated by the engine can reflect the order mostcommonly appearing in the input set of documents. In the example, theresume structure follows the order: objective, experience, education,clubs and affiliations, hobbies and contacts. Individual documents maychoose a different sequence. The goal of the sequencing formulas is todetermine the appropriate order of the data profiles by means ofstatistical averages, adjusted where necessary to handle exceptionalcases.

To illustrate the approach, formulas are applied first to identify therelative average location of each text block in its source file.

-   -   Sequencing Formula 1:

Relative File Sequence=File Sequence Order/Total File Text Blocks in theFile Text Block Sequence; and

Average=Sum (File Sequence)/Total Files in Text Block Group

Total Text Blocks Text Block Group Order in File in File RelativeSequence Hobbies 5 6 0.833 Hobbies 4 4 1.000 Hobbies 4 5 0.800 AverageRelative Sequence 0.878

Text Block Group Average Relative Sequence is therefore: 0.878. Applyingthe relative sequence of all text blocks groups, individual text blocksand all text blocks groups can be ordered by this sequence value. Insome cases, additional formulas are applied to capture sections thattypically appear in a particular order, such as the contacts sections ofa resume that is usually the last section of the file. Average relativesequence is supplemented with mean and modal sequence statistics togather information about absolute sequencing, such text blocks thattypically appear at the start or the end of a file, or typically appearfollowing a particular text block.

(ii) Hierarchy

Data profile also gathers information about the organizational structureand relationships between text blocks to determine the depth level ofeach text block and its parent-sibling-child relationships. Thehierarchical associations provide the engine more data points with whichto make decisions such as the familial relationships and sub sequencing.For example, where the engine is analyzing sample text block 15, thestructural elements shown in FIG. 6 may be identified.

Text Block 15 LEISURE TIME ACTIVITIES SPORTS AND RECREATION: Integermattis eros ut arcu. Maecenas sagittas, justo a pulvinar malesuada, enimeros blandit ipsum, eget ultices velit ipsum eu erat. HOBBIES: Loremipsum dolor sit amet, consecutetuer adipiscing elit. Proin vulputate,nibh sit amet tincidunt pellentesque, lorem punts mollis orci, eu mollisenim quam erat. Morbi facilis fermentum felis.

In the first pass (based on an untrained mode of operation), the enginerelies formatting and lexical rules to identify the relationships in thesame manner undertaken in identifying text blocks in the first iterationthrough the input set of documents. After data profiles have beenconstructed, the relationship information in the profiles can be appliedto identify the text blocks and the relationships between the textblocks. The engine captures and stores the profile information forhierarchically nested text blocks as sub text blocks applying the samemethodology used to capture and store captions, as discussed in section2.1.2(1)(b).

(iii) Common Sub-Groups Rules

Text block relationship rules analyze familial relationships of aparticular text block. In one approach, the engine matches text block 15“Leisure Time Activities” to the text block group 10 “Other Interests”because they posses matching children. In practice, numerous profilematching techniques applying the matching approaches described arerequired to determine if the children of a text block match the childrenof a text block group.

Text Block Group 10 Text Block Group 10 (Data Profile Hierarchy) OTHERINTERESTS SPORTS AND RECREATION HOBBIES

Sample sub Text Block Group Matching Formula 1:

-   -   If (more than a defined percentage of Sub Text Blocks Groups        match other Sub Text Blocks Groups)    -   Then the Text Block is a match for the Text Block Group.

In addition formulas are applied to identify word clusters within a textblock group to identify common groups of words within the text of eachmember of the text block group. These clusters are then used in the samemanner as the children in the example above. These clusters can also beused to highlight the important text in a text block for an end userand, conversely identify Proper nouns, specific document items, etc.

(iv) Relationship Formulas and Scores

Additional intelligence rules analyze familial relationships across theentire text block group set and determine the best location, level andsequence of each data profile. Familial relationships across the entiretext block group set apply formulas to cross-check profiles to look foradditional commonalities, deficiencies, and other information.

Identify Text Block Groups and Sub Text Block Groups:

The data profiles characteristics of parent, siblings and children areapplied in rules and formulas to identify text block groups with similarcharacteristics. For example, two text block groups sharing the sameparent and common children can be detected by a formula.

Identify Alternative Locations:

In a given document, the same or similar sections (text blocks) mayappear in different locations or in a different sequence in otherdocuments. For example, text block group 1 has a depth level of 2; whiletext block group 2 has depth-level of 4 and it not a descent of blockgroup 1. A formula can detect that these two have very similar profilesand in any individual document there will rarely, if ever, possess bothtext block groups. Thus, a standard representation of the documentstructure should not include both.

(e) Text Block-to-Text Block Group Set Comparisons

In general, this comparison applies data profile matching rules tocompare text blocks that are not a part of any text block group or textblocks in new documents (documents that are not a part of the text blockset data profile) to the data profiles in the selected set of text blockgroups matching either individual text blocks or the entire set of textblocks in the new document. The rules are based on base scores, matchscores, and resolution scores with adjusted thresholds to account forthe fact that the process involves a one-to-many comparison, as opposedthe many-to-many comparisons used for creating text block groups.

The process of matching operates in configurable sequence and in eachpass lists of matched and unmatched text blocks are maintained. First,level 1 text blocks and its direct descendents in the new document arecompared to level 1 data profiles in the text block group set and itsdirect descents (i.e. comparing among levels and between adjacentlevels). Second, each successive lower level text blocks and theirdirect descendents in the new document are compared lower level dataprofiles in the text block group set and their direct descents. Third,each successive text block level in the new document are compared eachsuccessive level in the text block group set, regardless of ancestry.Finally, any remaining unmatched text blocks in the new document isallowed to match any data profile in the text block group set.

(3) Bootstrapping—Using Data Profiles to Create and Enhance DataProfiles

This section begins with an example of creating data profiles in“untrained” mode. Once data profiles are initially created, they can beused to enhance the next creation of data profiles and allow the engineto operate in a “trained mode.” Examples of this bootstrapping follows,and additional examples of bootstrapping are described in tool creationengines 2.2.1.

(a) Bootstrapping Example—Document Decomposition

Bootstrapping may be applied, for example, to the process of documentdecomposition. In an untrained mode, decomposition relies on lexicalcharacteristics, such as paragraph break rules and identification ofheadings based on text formatting. With existing data profiles, documentdecomposition can more accurately decompose using the text block groupprofiles.

(b) Bootstrapping Example—Data Profile Creation

In a trained mode of operation, together with more accurate documentdecomposition, the data profile engine can re-process all documents andgenerate more accurate data profiles. This process can be repeated asspecified by settings or by formulas that instruct the data profileengine to stop re-processing once the resulting changes areinsignificant.

(c) Bootstrapping Example—Text Block Group-to-Text Block GroupComparison

In an untrained mode of operation, the process of creating text blockgroups is based on scorings that include potentially redundant textblock groups. As a result, the relative scores matching other text blockgroups is less accurate or non-existent. And because the engine lacksaccurate data regarding the importance of the words in the profiles, theengine utilizes different threshold settings for an untrained mode ofoperation.

Text block group word profiles build statistical information determininghow distinguishing the word is across all text block groups. In anuntrained mode, however, the engine cannot determine the howdistinguishing a word is compared to all text block groups, because theengine does not know which text block groups are redundant and thereforescores will be less accurate.

The processing flow creating text blocks groups is:

-   -   1) Create text block groups,    -   2) Perform redundancy check:        -   a) apply file commonality rules,        -   b) double check those with text block group vs. text block            group comparison of word profiles,    -   3) Create text block group word profiles calculating commonality        across text block groups,    -   4) Match text blocks to text block group,    -   5) Repeat steps 1 through 4 as many times as specified by        configuration settings.

In an untrained mode, the engine may lack sufficient information toaccurately perform step 2(b). In subsequent passes, or in a trained modeof operation, the engine can benefit from the distinguishing wordscalculation and scoring rules adjusted to reflect the greater precision.In this manner, using a trained mode of operation, the data profilesbuilds increasingly accurate information and more varied word examples;and thereby learns from the data.

2.2. Tool Creation Engines and Product Interfaces 2.2.1 Tool CreationEngine

Tool creation engines apply formulas to the data profiles to enhance thedata profiles and to create tools for end users such as searchcapability and document templates. The tool creation engines' formulascreate tool specific scores which can be stored back to the dataprofiles for use by other tool creation engines and to enhance theintelligence of the data profile engine for future processing. Anexample is a tool creation engine with sophisticated formulas tocalculate relationships across a text block group set. The results oftheses formulas are not only useful to create a tool to view a standardoutline of a document set, but the results are also stored to the dataprofiles to add intelligence to the document classification tool.

Different data domains will share appropriate tool creation engines butmay also require specialized tool creation engines.

Tool creation engines are constructed from the data profile engine andfacilitate the development of a wide range of software systems fororganization, research, reuse and validation of individuals of sets offiles. For example, the systems can be built for analyzing a set offiles and creating document templates, searching and classifying filesand any part of a file, identifying exemplar or alternative sections,and validating entire files or any section of file against the standardestablished by a template.

2.2.2 Product Interfaces

Product interfaces combine the tools into logical product packages for aspecific user domain and apply human readable interfaces to the toolscreated by the tool creation engines.

2.2.3 Tool Creation Engine and Product Interface Examples

Using the example of resume analysis discussed in the data profileengine section, a standard template, showing the frequency of occurrenceof each section of the template is shown in FIG. 9 and this example isused to illustrate other Tool Creation Engines and Product Interfaces.

(1) Tools for Document Organization

(a) Master Document Template Product Interface

(i) Description of Master Document Templates

A document template is a standardized, exemplary document structure,organizing all matching text blocks in a source set of documents andcreating a single organizing framework. From an end-user perspective itis a master specification or a guiding framework capturing andorganizing all distinct elements from all source documents. It can beviewed as an outline (See FIG. 2A) similar to the table of contents of abook and typically listed at the front of a publication, or an index oftopical items sometimes appearing at the end of a document (See FIG.2B).

The document template tool creation engine automates the process ofdefining the outline structure, identifying text examples associatedwith each outline element, and maintaining the structure as newstandards emerge. The engine can analyze a few files or many thousandsof documents. The greater the number of source documents, the morevariants the engine will identify.

The document template product interface provides an end-user interfaceon the results of the document template tool creation engine. FIGS. 7and 8 show examples of a legal document template generated by theengine. In addition to displaying the structure and organization of thestandardized document, information is displayed to show the frequencythat each section of the outline occurs in the template. Sectionsoccurring with high frequency may be denoted and considered as standardor even required elements; while those with a low frequency may denotedand be viewed as optional, document specific or emerging new language.

(ii) Types of Master Documents

The engine can produce different types of outlines, depending on theselection of the template generation rules.

(iii) Master Document Template

A master template aggregates all document type templates (such as model,checklist and universal templates) used in a particular professionalpractice area into a single comprehensive outline, such a master legaldocument template capturing all the discrete, matching document elementsfor every type of legal agreement, or a master building templatecontaining all the design and construction requirements for every typeof building project.

(iv) Model Document Template

A model template generates a standardized template for one particulartype of document, such as residential home construction specification ora credit agreement. The model template serves as a best practiceguideline containing standard sections (i.e. those over a statedcommonality threshold) and applies mutually exclusive rules to removeredundant variants so that each topical section occurs in only one placein the Template 7.

(v) Checklist Document Template

A checklist template contains all matching clause variants and redundantvariants, it is typically constructed from a large sample set ofdocument so that the template framework contains all topical elements,together with detailed sub-elements for each section, thereby creatingan outline of all items that may be considered in a particular project.

(vi) Universal Document Template

A universal template is an encyclopedia of all discrete topicalvariants, including infrequently occurring and unmatched sections andmay be used as a reference source.

(vii) Derivative Document Template

A derivative template is a template generated from a smaller set ofdocuments based on the template and data profiles created from a largerdocument collection. Derivative templates can be used where, forexample, a particular organization wishes to build a model template, butlacks sufficient examples to provide a data rich set of data profiles.In this circumstance, statistical information in the primary template isused to build a derivative whose structure solely based on the smallerset of documents.

(viii) Document Decomposition

A new document can be decomposed into an outline structure byidentifying text blocks in a new document and matching the text blocksto the selected data profiles, providing a navigation tool to link tosections in the document, whether or not the new document containscaptions or numbered sections, as shown in FIG. 10.

(b) Master Document Template Tool Creation Engine

Using the data profiles, the document template tool creation engineautomates the process of defining the outline structure, identifyingtext examples associated with each outline element and maintaining thestructure as new standards emerge. The engine can analyze a few files ormany thousands of documents. The greater the number of source documents,the more variants the engine will identify.

Each of the functions requires a tool creation engine to analyze andcreate the data necessary for the product interface.

(2) Tools for Document Research

(a) Text Search

A Search “Tool Creation Engine” is a set of formulas that use the dataprofiles to match text to text block groups in order to create a searchtool that can be used in an end-user product interface.

For example, if the desired search tool is a text box in which the usercan enter a textual query and receive a list of the top text block groupmatches, it may apply the following sets of formulas:

-   -   The end user entered text would be treated as a Text Block and        can be scored by formulas such as those previously discussed    -   The Text Block could now be compared to all of the Text Block        Group Data Profiles as discussed earlier or with new rules and        formulas        The resulting match list may be sorted by formulas similar to        those previously discussed. More advanced search tools can be        created by utilizing more of the information in the data        profiles such as specifying which text block groups should be        used.

(b) Document and Section Search

A document search “tool creation engine” can find entire documents byanalyzing the structure and contents of an example document compared totext block group sets and their data profiles. Topical sections ofdocuments can be located based on a selected section or fragment of textby comparison to the data profile in user selected or automaticallyidentified templates. In addition, the engine facilitates word searchingfor captions and text content.

(c) Document and Section Classification

A new document in its entirety is classified by comparing its structureand contents to document type text block group sets by applying dataprofile matching rules. Topical sections of new document are classifiedby comparison to program-selected (or user selected) text block groupsets and its associated data profiles.

For example, a tool creation engine can be created to analyze dataprofiles, word commonality and divergence between various text blockgroup sets and store findings to the data profiles. The data profileengine may thereafter use this information to classify documents tospecific text block group sets. For example, the existence and frequencyof a few words and the lack of a few other words indicate this documentis a resume.

(3) Tools for Document Drafting and Reuse

(a) Identification of Alternative Clauses

The document and section search tools are applied to find alternativeexamples of particular sections with the need for explicit searchtechniques. Users can select a particular section and the words in thesection are used as a search pattern to identify and find alternativeexample. The search result set can be ordered in sequence of conformityto the data profile or grouped by clustering the result set of sectionsbased on the attributes of the alternative sections.

(b) Identification of Default Clauses

In addition, the tool creation engine can identify an exemplar sectionas the alternative most closely matching the data profile.

(c) Document Outline Display

The tool creation engine may also be used to display the outline of adocument, such as a resume, and provide other functionality. Forexample, Formulas may be constructed to utilize the data profileinformation to produce the following:

-   -   Determine the Subject of the Resume: by defining formulas to        analyze word frequency, location and order of words in each text        block group of words compared to the data profile and the word        universe, the tool creation engine can identify the subject of        the resume as “Jacob Tegill's” resume.    -   Display and Outline of the Document: The outline of the document        is built from the matching text block groups.    -   Display a Caption for each Text Block Group: The caption for        each text block is drawn from sub-text block groups.    -   Provide Hypertext-like Navigation Links: Each caption heading        can be defined as a link to specified section in the document        using the file offset stored in the text blocks.

Jacob Teggill's Resume 1. Objective 2. How I Could Help ACME, Inc. 3.Experience 4. Contact Information 5. Hobbies

(4) Tools for Document Validation and Analysis

(a) Document Benchmarking

Entire documents and text sections are benchmarked to a selected textblock group set by comparison to the data profiles in the text blockgroup sets. Benchmarking identifies matching text blocks, displays thefrequency that each text block occurs in the text block group sets, theconsistency of the language in each text block in a document compared tothe data profiles in the text block group sets, and any text blocksoccurring with high frequency in the text block group set that aremissing from the benchmarked document.

Jacob Teggill's Resume 1. Objective (99%) 2. How I Could Help ACME, Inc.3. Experience (96%) 4. [Education] (98%) 5. Contact Information (97%) 6.Hobbies (42%) 7. [Clubs and Affiliations] (38%)

In the resume example, the tool creation engine analyzes the dataprofiles so that the product interface can provide feedback to the useridentifying:

-   -   The section entitled “How I Could Help ACME, Inc.” is not a        common section in a Resume.    -   The section titled “Contact Information” is typically, named        “Contact.”    -   The Hobbies section typically comes before CONTACTS    -   A Section titled “Education” is missing and is very frequent in        most resumes    -   A section titled “Clubs and Affiliations” present, but may        optionally be added.

(b) Document Standardization

In addition to provide a benchmarking report, the tool creation enginecan propose changes be made to a specific document. Applying sequencingformulas and sub-text block caption formulas, the system canautomatically re-order the sequence of the “Hobbies” section and renamethe “Contact Information” section to “Contacts.”

(c) Language Conformity

In addition to analysis occurrence and frequency of text blocks comparedto the data profiles, the tool creation engine can analyze the degree ofconsistency of the language used in each section of the new documentcompared to the data profiles. The analysis applies matching formulas tovalidate the present of sub-text blocks, common words, words pairs andword clusters. In the same manner as document benchmarking, thevalidation feed back can identify conforming language non-conforminglanguage, and missing language.

Jacob Teggill's Resume 1. Objective (38%) 2. How I Could Help ACME, Inc.3. Experience (74%) 4. Contact Information (65%) 5. Hobbies (51%)

In the resume example, the sections entitled “Objective” while matchingthe location and sequence of the Text Block in the Data Profiles showslittle conformity of the language used compared to the Data Profile andis therefore highlighted.

(d) Key Language Identification and Mark-Up

Typically text processing techniques for meta-data extraction operateusing rule-based techniques to capture titles, headings, names, placesand dates. The data profiles facilitates a tool creation engine thatuses the statistical information in the word universe and the dataprofiles to identify words and words clusters appearing in a documentwith much higher frequency than in the word universe. Words appearingfrequently in a particular file, but having a lower commonality score inthe word universe are typically document specific data, such as names ofpeople, companies, amounts and dates and other information specific tothe document being analyzed.

By constructing data profiles, the engine provides a new method of datamining and identifying document specific information. By comparing thestatistical information in the word universe to the data profiles,formulas identify words and word clusters appearing with higherfrequency in a document compared to the universe. For example a companyname may appear frequently in a particular document, but is rare in theuniverse.

Key data identification formulas identify document specific language,proper nouns, dates, and amounts.

Text Block 1 HOBBIES: In his leisure time, Jacob Teggill enjoys theactivities of Archery and Baseball and often relaxes by drawing orreading a book. Text Block 2 HOBBIES: Jack's leisure activities includeriding horses, skating, and spending time with his family.

In the example, bolded text represents common words and word clusters inthe text block group profiles. Italicized words are more common in thistext block group than in the word universe. Underlined words are veryuncommon in the word universe, but they appear unusually frequently in aparticular document and are likely key entities or subject matter topicsin the document. Combining the key data identification rules with otherknown attributes of data profiles offers faster and more accurate datamining compared to existing rules based approaches.

Use Cases (Examples from Law and Architecture)

A document research engine identifies the best matching model documentfor a new drafting project. For example, a lawyer may use the tool toidentifying a particular existing loan agreement closely matching areference standard from the lawyer's personal collection of documents, alaw firm's collection, or a public collection to be used as a startingpoint for a new drafting assignment. An architect may use the tool toretrieve a construction specification for a particular type of buildingproject from the architect's archives.

A document deconstruction and navigation tool summarizes the provisionsin a file and create an outline and index of all sections in a file anda method to hypertext link to each section. A lawyer reviewing a lengthyloan agreement, for example, may use the tool to create a outline of afile showing listing all its articles, clauses and sub-sections and usethe index to find and display each section.

A document analysis engine benchmarks an existing document against areference standard for commonality and consistency. For example, alawyer may use the tool to analyze a document drafted by another lawyeror another law firm to determine how the new document matches areference standard to access whether the clauses in the new document arefrequently used standard clauses, whether they are uncommon, whether anystandard clauses are missing, whether any clause is not located in thestandard place in the document, whether the language of each clausematches the reference standard, is missing any standard language orincludes deal-specific language. An architect may use the tool toanalyze a building specification to ensure it is complete, containsstandard building specifications, is not missing any standardspecifications, and that the language of each specification meets thebuilding code guidelines.

A document drafting tool identifies default standard clauses,alternative clauses and infrequently used clauses. A lawyer, forexample, reviewing a loan agreement may identify non-consistent clausesuses the tool and replace the clause or selected text in the clause witha standard or default clause or use the engine to display alternativeclauses (grouped by conformity to the reference set of clauses orgrouped by reference to the selected clause in the loan agreement), oruse the engine to display infrequently used clauses. An architectreviewing a building specification for a new home construction mayidentify non-standard building specifications and use the tool to findand replace such non-standard specifications with the standard buildingspecification and review alternative specifications.

A document analysis engine highlights standard terms in a document orsection of document showing standard language and non-standard language.A lawyer, for example, may select a particular clause in a loanagreement and highlight and distinguish the standard language of theclause, the deal-specific language and key deal terms (such a partynames, jurisdictions, dates, amounts etc.).

A document validation engine ensures that a document satisfies themarket standard. Upon completion if a drafting assignment, a lawyer orarchitect may use the tool to ensure that the document is complete,contains all required clauses or specifications, and that the languageof each clause contains all required terms.

The methods according to the present invention are implemented usingcomputer methods. In some implementations, various combinations ofsoftware and hardware may be used, as would be apparent to those ofskill in the art and as desired by the user. In addition, the presentinvention may be implemented in conjunction with a general purpose ordedicated computer system having processor, memory and displaycomponents, which may be communicatively coupled on an internal orexternal network (e.g., the World Wide Web/Internet) to other computingsystems including databases.

From the above description and drawings, it will be understood by thoseof ordinary skill in the art that the particular embodiments shown anddescribed are for purposes of illustration only and are not intended tolimit the scope of the present invention. Those of ordinary skill in theart will recognize that the present invention may be embodied in otherspecific forms without departing from its spirit or essentialcharacteristics. References to details of particular embodiments are notintended to limit the scope of the invention.

What is claimed is:
 1. A computer-implemented method for processingdocuments, comprising: generating, by a processor, a word universecomprising statistics of text words in a plurality of documents;deconstructing each document in the plurality of documents into one ormore text blocks; grouping the text blocks within each document andacross the plurality of documents into one or more text block groupsbased on word characteristics of each text block and matchcharacteristics between the text blocks compared to the word universe;and generating a data profile for a respective text block group, thedata profile comprising word characteristics and match characteristicsof the text blocks in the respective text block group.
 2. The method ofclaim 1, wherein the statistics of text words in the word universecomprises word frequencies, word weights, and statistical informationincluding minimum, maximum, mean, median and deviation values.
 3. Themethod of claim 1, wherein deconstructing each document comprises:identifying one or more text block demarcations in the document, thedemarcations including carriage returns, line feeds, paragraph breaks,headings, captions, prefixes and postfixes; and dividing the documentinto one or more text blocks based on the identified text blockdemarcations.
 4. The method of claim 1, wherein the word characteristicsof each text block comprise a base score calculated based on a weightedsum of word frequency of each word in the text block compared to theword universe.
 5. The method of claim 1, wherein the matchcharacteristics of the text blocks comprise a match score calculatedbased on a weighted sum of matching words between and among text blockscompared to the word universe.
 6. The method of claim 1, furthercomprising: merging one or more text block groups with matching dataprofiles into a text block group set; and generating a data profile forthe text block group set based on the data profiles of the one or moretext block groups.
 7. The method of claim 6, further comprising:reiterating the document deconstructing, text block grouping, and dataprofile generating based on the generated data profiles for the one ormore text block groups and text block group sets.
 8. The method of claim1, further comprising: receiving a source document for analysis;deconstructing the source document into one or more text blocks;comparing word characteristics and match characteristics of the one ormore text blocks associated with the source document with the generateddata profiles for the text block groups to identify matching text blockgroups; and determining similarity and divergence between the one ormore text blocks associated with the source document and the identifiedmatching text block groups.
 9. The method of claim 1, furthercomprising: identifying a set of default clauses from text blocks in atext block group; identifying one or more alternative clauses for eachdefault clause; identifying an outline from the text blocks in the textblock group; and generates a template based on the identified outline,default clauses and alternative clauses for the text block group. 10.The method of claim 1, further comprising: receiving a search query, thequery comprising text terms, clauses and/or text blocks; comparing wordcharacteristics of the search query with the generated data profiles forthe text block groups to identify matching text block groups; rankingtext blocks in the match text block groups based on matchcharacteristics between the search query and the text blocks; andpresenting the ranked text blocks as a search result.
 11. Anon-transitory computer-readable storage medium storing executablecomputer program instructions for processing documents, the computerprogram instructions comprising instructions for: generating, by aprocessor, a word universe comprising statistics of text words in aplurality of documents; deconstructing each document in the plurality ofdocuments into one or more text blocks; grouping the text blocks withineach document and across the plurality of documents into one or moretext block groups based on word characteristics of each text block andmatch characteristics between the text blocks compared to the worduniverse; and generating a data profile for a respective text blockgroup, the data profile comprising word characteristics and matchcharacteristics of the text blocks in the respective text block group.12. The non-transitory computer-readable storage medium of claim 11,wherein the statistics of text words in the word universe comprises wordfrequencies, word weights, and statistical information includingminimum, maximum, mean, median and deviation values.
 13. Thenon-transitory computer-readable storage medium of claim 11, wherein thecomputer program instructions for deconstructing each document furthercomprises instructions for: identifying one or more text blockdemarcations in the document, the demarcations including carriagereturns, line feeds, paragraph breaks, headings, captions, prefixes andpostfixes; and dividing the document into one or more text blocks basedon the identified text block demarcations.
 14. The non-transitorycomputer-readable storage medium of claim 11, wherein the wordcharacteristics of each text block comprise a base score calculatedbased on a weighted sum of word frequency of each word in the text blockcompared to the word universe.
 15. The non-transitory computer-readablestorage medium of claim 11, wherein the match characteristics of thetext blocks comprise a match score calculated based on a weighted sum ofmatching words between and among text blocks compared to the worduniverse.
 16. The non-transitory computer-readable storage medium ofclaim 11, wherein the computer program instructions further compriseinstructions for: merging one or more text block groups with matchingdata profiles into a text block group set; and generating a data profilefor the text block group set based on the data profiles of the one ormore text block groups.
 17. The non-transitory computer-readable storagemedium of claim 16, wherein the computer program instructions furthercomprise instructions for: reiterating the document deconstructing, textblock grouping, and data profile generating based on the generated dataprofiles for the one or more text block groups and text block groupsets.
 18. The non-transitory computer-readable storage medium of claim11, wherein the computer program instructions further compriseinstructions for: receiving a source document for analysis;deconstructing the source document into one or more text blocks;comparing word characteristics and match characteristics of the one ormore text blocks associated with the source document with the generateddata profiles for the text block groups to identify matching text blockgroups; and determining similarity and divergence between the one ormore text blocks associated with the source document and the identifiedmatching text block groups.
 19. The non-transitory computer-readablestorage medium of claim 11, wherein the computer program instructionsfurther comprise instructions for: identifying a set of default clausesfrom text blocks in a text block group; identifying one or morealternative clauses for each default clause; identifying an outline fromthe text blocks in the text block group; and generates a template basedon the identified outline, default clauses and alternative clauses forthe text block group.
 20. The non-transitory computer-readable storagemedium of claim 11, wherein the computer program instructions furthercomprise instructions for: receiving a search query, the querycomprising text terms, clauses and/or text blocks; comparing wordcharacteristics of the search query with the generated data profiles forthe text block groups to identify matching text block groups; rankingtext blocks in the match text block groups based on matchcharacteristics between the search query and the text blocks; andpresenting the ranked text blocks as a search result.